RNA-seq Review

  • RNA-seq is the high-throughput sequencing of mRNA transcripts
  • mRNA is converted to cDNA, amplified through PCR, and sequenced
  • Amplified fragments are reads; the total number of reads in a sample is its library size
  • Transcripts can be quantified(not aligned) against a transcriptome reference
  • Significant change in gene expression between experimental conditions is known as differntial gene expression

Beyond Bulk RNA-seq

  • RNA seq was typically performed on pooled RNA from many individual cells, because a large amount of initial RNA reduced the number of PCR cycles required for amplification
  • Better polymerases and sequencing technologies now allow for the sequencing of individual cells

Why study single cells

  • Many biological specimens, tissue, tumors, etc contains heterogeneous population of cells, with different cell types often having distinct function.
  • Before single cell RNA-seq, transcriptomic studies of individual cell types required flow sorting out a population of pure cells, only possible for well studied cells with distinct markers
  • Single cell RNA-seq allows for transcriptomic quantification of individual cells with the need for markers

Initial barriers

  • Early single cell technologies involved diluting cells into wells, and carrying out sequencing reactions within each well. These were costly and not super quantitative, as the low amount of RNA required increased amplification, and created a lot of noise, in the for of PCR duplicates

Technological improvements - U Mi’s

include_graphics('../src/umi.png')

  • Unique molecular identifiers(UMIs) solve the PCR duplication problem.
  • a UMI is typically a 7-8bp artificial nucleotide sequence
  • at 8bp this length there are 4^8 or 65536 different sequences that are possible of this length
  • After reverse transcription and fragmentation, and before amplification, all these UMI’s are added to individual cells, with the idea that each fragment is now labeled with a distinct identifier
  • when these fragments are amplified, sequenced and mapped back to the genome, instead of counting reads, many of which may be PCR duplicates, you count the number of UMIs

Technological improvements - Barcodes

  • the next challenge was adding labellers for each cell. This is typically done with molecular barcodes, similar to UMIs
  • To each well where you have a cell, add a distinct nucleotide sequence, so that all reads associated with an individual cell have a distinct barcode associated to that cell. barcodes are typically 10-12 bp, so 4^12 is 16777216 distinct nucleotide sequences.
  • These individual cells can now be pooled and sequenced as one big assay

Well based single cell RNA-seq

  • First generation of single cell RNA-seq (scRNA-seq)
  • individual cells are pipetted into wells(through serial dilution), where each well has reagents for RNA-seq library prep, barcodes, umi’s etc
  • cells are the lysed and sequenced.
  • this typically yields 100-1000 individual cells cells
  • Functions essentially the same as bulk RNA-seq, in that whole transcripts are amplified, and distinct isoforms of genes can be quantified.

Droplet based single cell RNA-seq

  • second generation of single cell RNA-seq (scRNA-seq)
  • individual cells are encapsulated in oil coated beads through a microfluidics system(loading phase)
  • number of droplet far exceeds the number of cells
  • RNA-seq library preparation occurs within these droplets

Droplet based single cell RNA-seq

  • only the 3’ end of transcripts are sequenced, so can only get gene level expression for about 2000 genes per cell
  • cells are the lysed and sequenced.
  • this typically yields 4000-8000 individual cells per assay. Studies often often use several assays, so its not uncommon to see studies with 50k-100k cells

Analysis of single cell RNA-seq

  • Analysis of single cell data is still in the “wild-west” phase of development, where there as many methods for single cell analysis published are there are studies using single cell RNA-seq
  • single cell RNA-seq presents some major computational challenges, as analysis of a droplet based sequencing matrix will give a a matrix of about 2000 rows and 50k columns, which would take about 3.5 gigabytes to store in RAM
  • Designing computational methods to work with this kind of data is difficult.

Major players for scRNA-seq analysis

  • Seurat - an R package for scRNA-seq analysis. Actively being developed, but maturing. Well documented, and works very well for analyses under ~200K cells.
  • Scanpy - a python package fo scRNA-seq, very similar to Seurat, and works well with large amounts of data (1 million+ cells ).
  • We’ll be working with Seurat

typical scRNA-seq analysis - Alignment

  • alignment encompasses alignment of reads to transcriptome, UMI quantification, and barcode identification.
  • 10x Genomics( the company who cells droplet based scRNA-seq) provides their own tool called cellranger for this.
  • It’s takes a VERY long time to run, and so is unusable for studies with 100k+ cells
  • kallisto + bustools - an opensource alternative thats MUCH faster.
  • It works very similar to Salmon, which we used yesterday.
  • I’m not going to cover how to use it today, as it takes a lot of time to run, and the authors have fantastic documentations on their website.

Working with SparseMatrices in R

  • I’ll be working with the Peripheral Blood Mononuclear Cells (PBMC) dataset, which is generated from a human blood sample
  • single cell data is usually loaded as a SparseMatrix - this type of matrix saves space by not storing any value equal to 0, and so for data with a lot of 0’s you end up saving a lot of space
library(dplyr)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Seurat)
## Warning: package 'Seurat' was built under R version 3.6.2
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")
head(pbmc.data)
## [1] 0 0 0 0 0 0
  • each row here corresponds to a gene, and each column is an individual cell, marked by the barcode for that specific cell

Seurat Objects

  • Seurat expects has a custom object it uses to store data. the object stores, raw counts, metadata, transformation, and dimensionality reductions.
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Seurat Objects

  • Seurat objects store data in slots. You can access slots with the @ operator
  • this is the metadata for this object, where each row corresponds to a cell
pbmc@meta.data %>% head()
##                  orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1     pbmc3k       2419          779
## AAACATTGAGCTAC-1     pbmc3k       4903         1352
## AAACATTGATCAGC-1     pbmc3k       3147         1129
## AAACCGTGCTTCCG-1     pbmc3k       2639          960
## AAACCGTGTATGCG-1     pbmc3k        980          521
## AAACGCACTGGTAC-1     pbmc3k       2163          781

Seurat Objects

  • to get the names of all slots of an object, use slotNames
slotNames(pbmc)
##  [1] "assays"       "meta.data"    "active.assay" "active.ident" "graphs"      
##  [6] "neighbors"    "reductions"   "images"       "project.name" "misc"        
## [11] "version"      "commands"     "tools"

Seurat Assays

  • Seurat has the functionality of storing multiple count matrices, which can be used for certain types of scRNA-seq methods,
  • each distinct count matrices is stored in the assay slot. Assay is a list, with each name corresponding to a specific count matrix
pbmc@assays
## $RNA
## Assay data with 13714 features for 2700 cells
## First 10 features:
##  AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115, NOC2L,
## KLHL17, PLEKHN1, RP11-54O7.17, HES4

Seurat Assays

  • the individual objects within the Assay slot are called Assay objects (computer scientists are not known for their naming skills )
class(pbmc@assays$RNA)
## [1] "Assay"
## attr(,"package")
## [1] "Seurat"
  • Assay objects are their own class, and have their own slots
slotNames(pbmc@assays$RNA)
## [1] "counts"        "data"          "scale.data"    "key"          
## [5] "assay.orig"    "var.features"  "meta.features" "misc"
  • one of the criticisms of Seurat is this complicated object structure, which is especially challenging for beginners.

Quality Control

  • There are two important QC steps right off the bat:
  • Removing cells with high mitochondrial RNA expression
  • Removing Doublets

High Mitochondrial Gene expression

  • High Mitochondrial Gene expression is a sign of a dead cell
  • as cells die, mitochondria burst and release mtRNA
  • We don’t want to measure dead cells, and so its good practice to remove cells that have over 10% of total gene expression attributed to mtRNA
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# pattern corresponds to the prefix for mtGenes

High Mitochondrial Gene expression

  • this adds percent.mt as a column to pbmc@meta.data
pbmc@meta.data %>% head
##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
## AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551

High Mitochondrial Gene expression

  • we can visualize this via a violin plot
VlnPlot(pbmc, features = "percent.mt")

Removing High Mitochondrial genes

  • meta.data is a data.frame with rownames, so to use tidyverse functions we needs to add a column
pbmc[['barcode']] <- rownames(pbmc@meta.data)
cells_pass_mt_qc <- pbmc@meta.data %>% filter(percent.mt < 10 ) %>% pull(barcode)
#remembers that cells are columns here 
pbmc <- pbmc[, cells_pass_mt_qc]

Doublet

  • Occasionally during the loading phase, two cells will enter the same oil droplet
  • this goes against the idea of “single cells”, and so we need to remove them.
  • Unfortunately, its still an on going problem that people are working on
  • Most tools identify doublets after the clustering stage of the analysis, which we’ll talk about later.

Count Transformation

  • Just like bulk RNA-seq we need to normalize the count data.
  • there’s still a lot of debate about how to best normalize scRNA-seq data
  • for now we’ll stick with the standard log transformation
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Identifying variable features

  • In order to speed up calculations, we want to select genes that show the most difference between cells
  • These are genes that are most variable across the population of cells
pbmc <- FindVariableFeatures(pbmc, 
                             assay = 'RNA', 
                             selection.method = "vst", 
                             nfeatures = 2000)

Identifying variable features

  • this is stored in the var.features slot of the assay we selected when running FindVariableFeatures
pbmc@assays$RNA@var.features %>% head
## [1] "PPBP"   "S100A9" "LYZ"    "IGLL5"  "GNLY"   "FTL"
VariableFeaturePlot(pbmc)

Center scaling the data

  • for each gene subtract the mean from it and divide by the variance. This sets the mean to 0 and the variance to 1 for all genes. This is required for some downstream analysis like clustering
pbmc <- ScaleData(pbmc)
## Centering and scaling data matrix

Dimensionality Reduction - PCA

  • To save on compute time we reduce dimensions with PCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
  • dimensions are stored in the reductions slot
pbmc@reductions
## $pca
## A dimensional reduction object with key PC_ 
##  Number of dimensions: 50 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA

Dimensionality Reduction - PCA

  • We can visualize this with DimPlot
DimPlot(pbmc, reduction = 'pca')

Selecting Principal Components

  • PCA returns multiple principle components(PCs),
pbmc@reductions$pca@cell.embeddings %>% dim
## [1] 2694   50
  • Each represents an axis of variation for a different set of genes.
  • The choice of which PC’s to include for down stream analysis is a very important step

Selecting Principal Components

  • we can visualize the genes captured by each PC with the DimHeatmap function
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

Selecting Principal Components

  • another way is the elbow plot
ElbowPlot(pbmc)

Clustering

  • Clustering is one of the core steps of scRNA-seq analysis
  • Using our reduced dimensions, want to find subpopulations within our set of data that correspond to distinct cell types, which we do by clustering
  • In its simplest form, clustering calculates the distance between all data points in a data set, and says data points that are less than X distance apart are one cluster.
  • Seurat uses the Louvain clustering algorithm for clustering
  • the resolution parameter corresponds to the number of distinct clusters to make choose values between 0.4-1.2
pbmc <- FindNeighbors(pbmc, dims = 1:10)# Remember we are only using the first 10 PC's
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2694
## Number of edges: 97655
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8729
## Number of communities: 10
## Elapsed time: 0 seconds

Nonlinear dimensionality reduction

  • A drawback of PCA is that it only can only linear decompose variance
include_graphics('../src/pca_nl.png')

Nonlinear dimensionality reduction

  • Other dimensionality reduction like t-stochastic neighbor embedding(tSNE) or Uniform Manifold Approximation and Projection(UMAP)
  • UMAP has become the standard
pbmc <- RunUMAP(pbmc, dims = 1:10)# only first 10
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 08:55:57 UMAP embedding parameters a = 0.9922 b = 1.112
## 08:55:57 Read 2694 rows and found 10 numeric columns
## 08:55:57 Using Annoy for neighbor search, n_neighbors = 30
## 08:55:57 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 08:55:57 Writing NN index file to temp file /var/folders/43/bxszl5k54ml3ljc8rqdcc7vchdgq7y/T//RtmpIgf5oX/file5cad14949d1e
## 08:55:57 Searching Annoy index using 1 thread, search_k = 3000
## 08:55:57 Annoy recall = 100%
## 08:55:58 Commencing smooth kNN distance calibration using 1 thread
## 08:55:58 Initializing from normalized Laplacian + noise
## 08:55:58 Commencing optimization for 500 epochs, with 107586 positive edges
## 08:56:01 Optimization finished

Nonlinear dimensionality reduction

  • this is our umap plot
DimPlot(pbmc, reduction = "umap", group.by = 'orig.ident')

  • Here’s PCA for reference
DimPlot(pbmc, reduction = 'pca',  group.by = 'orig.ident')

Nonlinear dimensionality reduction

  • this become even more apparent when we add our cluster labels
DimPlot(pbmc, reduction = "umap", group.by ='seurat_clusters' )

  • pca
DimPlot(pbmc, reduction = "pca", group.by ='seurat_clusters')

identifying cluster markers

  • the next key step is identifying markers for each cluster. these are the genes that are representative of only 1 cluster.
  • These genes can then be matched back to the literature for known cell types, or for annotating a new one
  • min.pct mean the minimum fraction of cells within a cluster a gene must be expressed in, and logfc.threshold defines
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
head(pbmc.markers)
##               p_val avg_logFC pct.1 pct.2     p_val_adj cluster  gene
## RPS6  6.065761e-118 0.4498788     1 0.994 8.318585e-114       0  RPS6
## RPS12 3.885735e-117 0.4786291     1 0.991 5.328897e-113       0 RPS12
## RPS27 2.800750e-115 0.4780629     1 0.992 3.840949e-111       0 RPS27
## RPS14 1.815983e-111 0.4074665     1 0.994 2.490439e-107       0 RPS14
## RPL32 5.886010e-111 0.4023982     1 0.995 8.072074e-107       0 RPL32
## RPS25 4.713038e-103 0.4931592     1 0.975  6.463461e-99       0 RPS25

identifying cluster markers

  • We can explore these with some plots
VlnPlot(pbmc, features = c("MS4A1", "CD79A"), group.by ='seurat_clusters' )

identifying cluster markers

  • ridgeplot
RidgePlot(pbmc, features = c("NKG7", "PF4"),group.by ='seurat_clusters' )
## Picking joint bandwidth of 0.142
## Picking joint bandwidth of 0.0287

identifying cluster markers

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A"))

cluster heatmap

  • the very standard cluster heatmap
top10 <- pbmc.markers %>% group_by(cluster) %>% 
  top_n(n = 10, wt = avg_logFC) # select top 10 markers for each gene
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Automatic annotation

  • the package singleR can compare your list of cluster markers to published data sets like tabula muris or the human cell atlas

Beyond single cell RNA-seq

  • Newer iterations of scRNA-seq allow you to capture more information

Single Nucleus sequencing

  • Very similar to standard scRNA seq, but useful for tissues that are difficult to dissociate into single cells

RNA + Protein

  • Quantify gene expression and identify proteins
  • published protocols REAP-seq and CITE-seq

Spatial Transcriptomics

  • RNA-barcoded probes are fixed to slide
  • tissue is then mounted on slide and permeablized,and RNA attached to probe